Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Stop cross-validation early if the parameters have high predicted test loss #915

Merged
merged 13 commits into from
Jan 10, 2020

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Dec 19, 2019

This makes two changes:

  1. It sets the number of folds to use based on the amount of data available: we ensure each training fold has sufficient data per feature.
  2. During hyperparameter optimisation, we use a linear model to predict the cross-validation loss on the remaining folds and stop early if the predicted test loss has a low chance of being less than the best values found so far.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Since there is some very sophisticated code there, I included some comments to simplify code understanding.

lib/maths/CBoostedTreeFactory.cc Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeFactory.cc Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeFactory.cc Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeImpl.cc Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeImpl.cc Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeImpl.cc Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeImpl.cc Outdated Show resolved Hide resolved
@tveasey
Copy link
Contributor Author

tveasey commented Jan 9, 2020

Thanks for the review @valeriy42. I think I've now addressed all your comments.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good job on explaining complex algorithmic bits.

@@ -46,7 +46,10 @@ const double MIN_DOWNSAMPLE_LINE_SEARCH_RANGE{2.0};
const double MAX_DOWNSAMPLE_LINE_SEARCH_RANGE{144.0};
const double MIN_DOWNSAMPLE_FACTOR_SCALE{0.3};
const double MAX_DOWNSAMPLE_FACTOR_SCALE{3.0};
const std::size_t MAX_NUMBER_FOLDS{5};
// This isn't a hard limit be we increase the number of default training folds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, something wrong with this sentence.

Comment on lines +268 to +272
// So how does the following work: we'd like "c * f * # rows" training rows.
// For k folds we'll have "(1 - 1 / k) * # rows" training rows. So we want
// to find the smallest integer k s.t. c * f * # rows <= (1 - 1 / k) * # rows.
// This gives k = ceil(1 / (1 - c * f)). However, we also upper bound this
// by MAX_NUMBER_FOLDS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice explanation!

@tveasey tveasey merged commit c2c436c into elastic:master Jan 10, 2020
@tveasey tveasey deleted the early-stopping-cv branch January 10, 2020 20:37
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Jan 13, 2020
tveasey added a commit that referenced this pull request Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants